## Loading required package: grid
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Lets begin by showing dataset variables
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
Now I need to get more details about the types of variables in the dataset
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
I can see that this dataset has 1599 observations with 13 variables. all the variables are of type num except for X and quality which are of type int.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
This summary shows that X represents observations numbering or identifiers So it has no effect on the quality of the red wine. we can ignore it.
The quality is an ordered, discrete variable.
Quality of 75% of red wines are less than or equal to 6.
The other variables are continuous variables.
median fixed.acidity is 7.90.The max volatile.acidity is 1.58.The median PH is 3.31
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
It seems like quality has a normal distribution with discrete values of quality (3, 4, 5, 6, 7 amd 8).
Almost 640 wine have quality 5, 620 have quality 6, then 7, 4, 8, and finally 3 with the least number of wines
we can categorize the qaulity into 3 categories (bad, fair and good) by creating new categorical variable called quality_rating
## bad fair good
## 63 1319 217
Here we have 63 bad wines, 1319 fair wines and 217 good wines.
The most dominant quality is the fair quality
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
volatile.acidity has a long tailed distribution.lets trasnsform it using log10 base
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
volatile.acidity is normally distributed.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
fixed.acidity has a long tailed distribution.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
displaying fixed.acidity on log10 base scale reveals that fixed.acidity has a normal distribution.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
##
## FALSE TRUE
## 1467 132
As we can see most wines have citric.acid between 0 and 0.5. 132 red wines have 0 citric.acid value.
citric.acid is not normally distributed.
Now I’ll create new varible represents total fixed acids of wine (fixed.acidity + citric.acids). lets call it total.fixed.acids
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
total.fixed.acids variables has a long tailed distribution.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Plotting total.fixed.acids on log 10 base scale reveals that total.fixed.acids is normally distributed.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
residual.sugar has heavy tailed distribution with alot of outliers.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Even log10 scale of residual.sugar assures this heavy tailed distribution.
I thought of creating new variable classifying red wines into 2 categories (sweet and non-sweet) wines but in the dataset the max value of residual.sugar is 15.500 and the wine is considered sweet if it has at least 45 residual.sugar.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
The last plot show that most wines have chlorides value less than 0.2. It also shows that chlorides has heavy tailed distribution with many outliers like residual.sugar.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
The distribution is not clear.I need to adjust binwidth to get better visualization
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
adjusting binwidth reveals that free.sulfur.dioxide has a right skewed distribution.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
total.sulfur.dioxide has a right skewed distribution like free.sulfur.dioxide.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect
It is obvious that density is normally distributed
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
PH has a normal distribution with few outliers.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
sulphates has a long tailed distribution.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Transforming sulphates on log10 base shows that sulphates has a normal distribution.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
alcohol has a right skewed distribution.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
even transformation on log10 base assures that alcohol has a non normal distribution.
1599 red wines in the dataset with 13 features (x, fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol and quality). All variables are num except for X and quality which are int.
Other observations: The quality is a discrete variable while all others are continuous variables.
Quality variable has a normal distribution
volatile.acidity, fixed.acidity and sulphates appear to have normal distribution when plotting them on log 10 base.
chlorides and residual.sugar have heavy tailed distribution with alot of outliers.
free.sulfur.dioxide and total.sulfur.dioxide have a long tailed distribution.
density and pH are normally distributed.
The quality of 75% of red wines are less than or equal to 6.
Many wines have 0 citric.acid
Min Quality is 3 and Max quality is 8.
Median fixed.acidity is 7.90.
Max volatile.acidity is 1.58.
Median PH is 3.31.
X variable is just an identifier of the observations.
I am very interested in the quality of red wine. I want to explore the variables affecting it.
from googling and the variables descriptions, I think that the bellow variables will support my investigation into the quality variable
1- Acids [Fixed, Volatile and citric]
2- alcohol
3- pH
4- total sulfur dioxide
1- quality_rating: which is a categorical variable of quality variable
2- total.fixed.acids: sum of fixed and citric acids in wine
there were long tailed and heavy tailed distributions besides normal. All I did with these data just setting binwidth and transform data to get better visualization.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.256130895 0.67170343
## volatile.acidity -0.25613089 1.000000000 -0.55249568
## citric.acid 0.67170343 -0.552495685 1.00000000
## residual.sugar 0.11477672 0.001917882 0.14357716
## chlorides 0.09370519 0.061297772 0.20382291
## free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813
## total.sulfur.dioxide -0.11318144 0.076470005 0.03553302
## density 0.66804729 0.022026232 0.36494718
## pH -0.68297819 0.234937294 -0.54190414
## sulphates 0.18300566 -0.260986685 0.31277004
## alcohol -0.06166827 -0.202288027 0.10990325
## quality 0.12405165 -0.390557780 0.22637251
## total.fixed.acids 0.99704157 -0.294847154 0.72665884
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.114776724 0.093705186 -0.153794193
## volatile.acidity 0.001917882 0.061297772 -0.010503827
## citric.acid 0.143577162 0.203822914 -0.060978129
## residual.sugar 1.000000000 0.055609535 0.187048995
## chlorides 0.055609535 1.000000000 0.005562147
## free.sulfur.dioxide 0.187048995 0.005562147 1.000000000
## total.sulfur.dioxide 0.203027882 0.047400468 0.667666450
## density 0.355283371 0.200632327 -0.021945831
## pH -0.085652422 -0.265026131 0.070377499
## sulphates 0.005527121 0.371260481 0.051657572
## alcohol 0.042075437 -0.221140545 -0.069408354
## quality 0.013731637 -0.128906560 -0.050656057
## total.fixed.acids 0.121334969 0.108045144 -0.148947646
## total.sulfur.dioxide density pH
## fixed.acidity -0.11318144 0.66804729 -0.68297819
## volatile.acidity 0.07647000 0.02202623 0.23493729
## citric.acid 0.03553302 0.36494718 -0.54190414
## residual.sugar 0.20302788 0.35528337 -0.08565242
## chlorides 0.04740047 0.20063233 -0.26502613
## free.sulfur.dioxide 0.66766645 -0.02194583 0.07037750
## total.sulfur.dioxide 1.00000000 0.07126948 -0.06649456
## density 0.07126948 1.00000000 -0.34169933
## pH -0.06649456 -0.34169933 1.00000000
## sulphates 0.04294684 0.14850641 -0.19664760
## alcohol -0.20565394 -0.49617977 0.20563251
## quality -0.18510029 -0.17491923 -0.05773139
## total.fixed.acids -0.10127190 0.65737801 -0.68958445
## sulphates alcohol quality
## fixed.acidity 0.183005664 -0.06166827 0.12405165
## volatile.acidity -0.260986685 -0.20228803 -0.39055778
## citric.acid 0.312770044 0.10990325 0.22637251
## residual.sugar 0.005527121 0.04207544 0.01373164
## chlorides 0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide 0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide 0.042946836 -0.20565394 -0.18510029
## density 0.148506412 -0.49617977 -0.17491923
## pH -0.196647602 0.20563251 -0.05773139
## sulphates 1.000000000 0.09359475 0.25139708
## alcohol 0.093594750 1.00000000 0.47616632
## quality 0.251397079 0.47616632 1.00000000
## total.fixed.acids 0.202161691 -0.04578490 0.13852654
## total.fixed.acids
## fixed.acidity 0.9970416
## volatile.acidity -0.2948472
## citric.acid 0.7266588
## residual.sugar 0.1213350
## chlorides 0.1080451
## free.sulfur.dioxide -0.1489476
## total.sulfur.dioxide -0.1012719
## density 0.6573780
## pH -0.6895844
## sulphates 0.2021617
## alcohol -0.0457849
## quality 0.1385265
## total.fixed.acids 1.0000000
we can see that quality has a moderate positive correlation with alcohol (0.476) and negative correlation with volatile.acidity (-0.391).
pH is highly correlated with both fixed.acidity (-0.683) and citric.acid (-0.542) which is meaningful relationship refering to pH description
Also note that free.sulfur.dioxid correaltes with total.sulfur.dioxide (0.668) which is meaningful as free.sulfur.dioxid is subset of total.sulfur.dioxide.
Finally we can see that total_acids is correlated with fixed.acidity (0.996) and citric.acid (0.690) which seems logical because total acids variable is the sum of the 3 acids.
This plot shows that high quality wine has high value of alcohol.we can also notice the vertical strips which indicates that quality is a discrete variable taking one of these values (3, 4, 5, 6, 7, and 8).Median increases for high qualities(6, 7 and 8). 75 % of high quality wines have alcohol values exeeding 11. in lower qualities wines it is under 11.
high quality wine has low value of volatile.acidity which matches the effect of high level of volatile.acidity on the quality (in volatile.acidity variable description)
high qualities have high values of fixed.acidity
This plot shows a clear positive impact of citric acid on the quality.
The plot doesn’t show any impact of residual.sugar on quality.median and quantiles values of residual.sugar cross the ualities are very close.
high quality wine has low value of chlorides
The plot shows that there is no linear relationship between free.sulfur.dioxide and quality
I can hardly see that high quality wine has low total.sulfur.dioxide value.
low qualities have high values of density while high qualities have lower values of density.So we can say that high quality wine has low density value.
It is seems that high quality wine has low pH value
high quality wines have higher values of sulphates than low quality wines.
high quality wines have high value of total.fixed.acids
bad and fair wines have the same median value of alcohol but good wines have higher median value of alcohol.
This plot shows that good wines have high values of both fixed.acidity and citric.acid, and low values of volatile.acidity.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
The plot shows an exponential relationship between total.fixed.acids and pH.
The plot reveals a linear relationship between total.fixed.acids and fixed.acidity
quality correlates fairly with alcohol and volatile.acidity, high quality wines have high value of alcohol and low value of volatile.acidity.
quality has low positive correlation with fixed.acidity, citric.acid, sulphates and total.fixed.acids.It has low negative correlation with density, total.sulfur.dioxide and chlorides.
It seems that quality has a weak correlation with pH, residual.sugar and free.sulfur.dioxide.
Yes I observed the relationship between total_acids and pH Which seems to be exponential with strong negative correlation, which is logical as pH is a measure of acids.
Also the relationship between fixed.acidity and total.fixed.acids was observed linear relationship.
The relationship between fixed.acidity and total.fixed.acids
In this section I ’ll explore the most interesting variables that may affect the quality in conjunction with quality and quality_rating variable.
This plot shows a weak negative correlation between volatile.acidity and alcohol. we can notice that good wines have high values of alcohol and low values of volatile.acidity.
There is a strong relationship between fixed.acidity and citric.acid. it is also clear that good quality has high value of both fixed.acidity and citric acid.
The strong relationship between free.sulfur.dioxide and total.sulfur.dioxide is clear.we can also notice that free.sulfur.dioxide has almost no effect on the quality.all qualities take almost the same range of free.sulfur.dioxide’s values. But for total.sulfur.dioxide we can hardly see that high quality wines have high value of free.sulfur.dioxide.
This plot shows the relationships between pH and both of fixed.acidity and total.fixed.acids which seem to be strong.
As fixed.acidity or total.fixed.acids increases the pH decreases. It is also clear that good wines have high values of fixed.acidity and total.fixed.acids, and low values of pH.
There is a weak correlation between alcohol and pH. good wines have higher alcohol and lower pH values than bad and fair wines.
This plot shows no relationship between alcohol and total.fixed.acids but it reveals that good wines have high values of both alcohol and total.fixed.acids.
By faceting the plots by quality rating, I can visualize the relationships between many variables and thier impact on the quality.
Starting with alcohol which has the highest correlation with quality, I notice that when alcohol increases the volatile.acidity decreases and wine quality increases.which is meaningful as we know from variable description that good wines have low value of volatile.acidity.
There is no relationship between alcohol and total.fixed.acids but both variables correlate with quality.good wines have high values of alcohol and total.fixed.acids.
Fixed.acidity and citric.acid are correlated to each others and have a little impact on quality. good wines have high values of both Fixed.acidity and citric.acid.
free.sulfur.dioxide and total.sulfur.dioxide are strongly correlated. we can see no impact for free.sulfur.dioxide on the quality but total.sulfur.dioxide has a little positive impact on the quality.
Finally pH seems to have strong correlation with both fixed.acidity and total.fixed.acids which is meaningful as pH is a measure of fixed.acidity and fixed.acidity is subset of total.fixed.acids.it also has a weak correlation with alcohol.
From googling for pH variable I Knew that it is almost the backbone of wine quality but surprisingly I found almost no relationship between quality and pH it is extremely weak.
No
This plot demonstrates wine quality, which takes an ordered discrete value from 3 to 8. filled in quality rating which categorizes the quality values into 3 categories [bad, fair, and good].
we can see that the most dominant quality in the dataset is fair [5 and 6], then good [7 and 8], and least one is bad [3 and 4]
This plot shows the impact of acids on the quality.we can clearly see that good wines have low values of volatile.acidity and high values of both fixed and citric acids.
This plot demonstrates the impact of alcohol on quality, good wines have high values of alcohol.
This dataset contains 1599 observations of 15 variables including quality varible that is my interesting feature. I began exploring all the variables individually.
After that I created a new categorical variable ‘quality_rating’ which categorizes the quality in a meaningful term rathar than the quality numbers.I have also collected all fixed acids variables (fixed.acidity and citric acids) into one variable total.fixed.acids.
I explored the quality variable across all the other variables to understand the impact of the variables on the quality.
I am able to specify that the main features affecting the quality are alcohol and acids.there is also other features that have a low impact like sulphates, density and pH.
I would be interesting in creating a linear model and testing its accuracy